Q-Learning Algorithm

Summary

Q-Learning is a model-free algorithm, which means it can explore an environment which may not be fully defined. The values for each state-action pair are estimated based on observations of that environment. More specifically Q-Learning is a TD or Temporal Difference learning approach, because state changes are learned with the assumption that they are sequential, or time-based.

{\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}

The Q-Learning algorithm is represented with an iterative equation that includes a learning rate(\alpha), and a discount factor(\gamma). The learning rate is a value between 0 and 1 and represents the portion of new information that is incorporated into the q-value at each time step. The discount factor is also a value between 0 and 1 and represents the portion of the future rewards that influence the new q-value at each time step.

As the agent explores the environment and acquires experience from different state-action pairs, it converges on a policy of action for any given state it observes.

Quiz 1: Q-Learning Algorithm

SOLUTION:

The new q-value would just be the old q-value; nothing would be learned.
The learned value would be ignored.
The discount rate would have no effect on the new q-value.